Flower, dog, anxious, senior, car, Item, president, worried, avacado, Zendaya, licorice, Nerdfighter, toothbrush, zany, expedient, This isn’t really a vlogbrothers video. It’s just a random string of words. There aren’t any coherent sentences. It looks like John Green bot could use some help speaking a bit more like human John Green - sounds like an excellent task for Natural Language Processing. INTRO Hey, I’m Jabril and welcome to Crash Course AI! Today, we’re going to tackle another hands-on lab. Our goal today is to get John-Green-bot to produce language that sounds like human John Green… and have some fun while doing it. We’ll be writing all of our code using a language called Python in a tool called Google Colaboratory, and as you watch this video, you can follow along with the code in your browser from the link we put in the description. In these Colaboratory files, there’s some regular text explaining what we’re trying to do, and pieces of code that you can run by pushing the play button. Now, these pieces of code build on each other, so keep in mind that we have to run them in order from top to bottom, otherwise we might get an error. To actually run the code and experiment with changing it you’ll have to either click “open in playground” at the top of the page or open the File menu and click “Save a Copy to Drive”. And just an fyi: you’ll need a Google account for this. Now, we’re going to build an AI model that plays a clever game of fill-in-the-blank. We’ll be able to give John-Green-bot any word prompt like “good morning,” and he’ll be able to finish the sentence. Like any AI, John-Green-bot won’t really understand anything, but AI generally does a really good job of finding and copying patterns. When we teach any AI system to understand and produce language, we’re really asking it to find and copy patterns in some behavior. So to build a natural language processing AI, we need to do four things: First, gather and clean the data. Second, set up the model. Third, train the model. And fourth, make predictions. So let’s start with the first step: gather and clean the data. In this case, the data are lots of examples of human John Green talking, and thankfully, he’s talked a lot online., We need some way to process his speech. And how can we do that? Subtitles. And conveniently there’s a whole database of subtitle files on the nerdfighteria wiki that I pulled from. I went ahead and collected a bunch and put them into one big file that’s hosted on crash course ai’s GitHub.. This first bit of code in 1.1 loads it. So if you wanted to try to make your AI sound like someone else, like Michael from Vsauce, or me, this is where you’d load all that text instead. Data gathering is often the hardest and slowest part of any machine learning project, but in this instance its pretty straightforward. Regardless, we still aren’t done yet, now we need to clean and prep our data for our model. This is called preprocessing. Remember, a computer can only process data as numbers, so we need to split our sentences into words, and then convert our words into numbers. When we’re building a natural language processing program the term “word” may not capture everything we need to know. How many instances there are of a word can also be useful. So instead, we’ll use the terms lexical type and lexical token. Now a lexical type is a word, and a lexical token is a specific instance of a word, including any repeats. So, for example, in the sentence: The goal of machine learning is to make a learning machine. We have eleven lexical tokens but only nine lexical types, because “learning” and “machine” both occur twice. In natural language processing, tokenization is the process of splitting a sentence into a list of lexical tokens. In English, we put spaces between words, so let’s start by slicing up the sentence at the spaces. “Good morning Hank, it’s Tuesday.” would turn into a list like this. And we would have five tokens. However there are a few problems. Something tells me we don’t really want a lexical type for Hank-comma and Tuesday-period, so let’s add some extra rules for punctuation. Thankfully, there are prewritten libraries for this. Using one of those, the list would look something like this. In this case we would have eight tokens instead of five, and tokenization even helped split up our contraction “it’s” into “it” and “apostrophe-s.” Looking back at our code, before tokenization, we had over 30,000 lexical types. This code also splits our data into a training dataset and a validation dataset. We want to make sure the model learns from the training data, but we can test it on new data it’s never seen before. That’s what the validation dataset is for. We can count up our lexical types and lexical tokens with this bit of code in box 1.3. And it looks like we actually have about 23,000 unique lexical types. But remember how many instances of a word can also be useful. This code block here at step 1.4 allows us to separate how many lexical types occur more than once twice and so on. It looks like we’ve got a lot of rare words -- almost 10,000 words occur only once! Having rare words is really tricky for AI systems, because they’re trying to find and copy patterns, so they need lots of examples of how to use each word. Oh Human John Green. Your master of prose. Let’s see what weird words you use. Pisgah? What even is a lilliputian? Some of these are pretty tricky and are going to be too hard for John-Green-bot’s AI to learn with just this dataset But others seem doable if we take advantage of morphology. Morphology is the way a word gets shape-shifted to match a tense, like you’d add an “ED” to make something past tense, or when you shorten or combine words to make them totes-amazeballs. Dear viewers, I did not write that in the script. In English, we can remove a lot of extra word endings, like ED, ING, or LY, through a process called stemming. And so, with a few simple rules, we can clean up our data even more. I’m also going to simplify the data by replacing numbers with the hashtag or pound signs. Whatever you want to call it. This should take care of a lot of rare words. Now we have 3,000 fewer lexical types and only about 8,000 words only occur once. We really need multiple examples of each word for our AI to learn patterns reliably, so we’ll simplify even more by replacing each of those 8,000 or so rare lexical tokens with the word ‘unk’ or unknown. Basically, we don’t want John-Green-bot to get embarrassed if he sees a word he doesn’t know. So by hiding some words, we can teach John-Green-bot how to keep writing when he bumps into a one-time made-up words like zombicorns. And just to satisfy my curiosity… Yeah, John-Green-bot doesn’t need words like “whippersnappers” or “zombification”. John what’s up with the fixation with zombies? Anyway, we’ll be fine without them. Now that we finally have our data all cleaned and put together, we’re done with preprocessing and can move on to Step 2: setting up the model for John-Green-bot. There are a couple key things that we need to do. First, we need to convert the sentences into lists of numbers. We want one word for every lexical type, so we’ll build a dictionary that assigns every word in our vocabulary a number. Second, unlike us, the model can read a bunch of words at the same time, and we want to take advantage of that to help John-Green-bot learn quickly. So we’re going to split our data into pieces called batches. Here, we’re telling the model to read 20 sequences (which have 35 words each) at the same time! Alright! Now, it’s time to finally build our AI. We’re going to program John-Green-bot with a simple language model that takes in a few words and tries to complete the rest of the sentence. So we’ll need two key parts, an embedding matrix and a recurrent neural network or RNN. Just like we discussed in the Natural Language Processing video last week, this is an “Encoder-Decoder” framework. So let’s take it apart. An embedding matrix is a big list of vectors, which is basically a big table of numbers, where each row corresponds to a different word. These vector-rows capture how related two words are. So if two words are used in similar ways, then the numbers in their vectors should be similar. But to start, we don’t know anything about the words, so we just assign every word a vector with random numbers. Remember we replaced all the words with numbers in our training data, so now when the system reads in a number, it just looks up that row in the table and uses the corresponding vector as an input. Part 1 is done: Words become indices, which become vectors, and our embedding matrix is ready to use. Now, we need a model that can use those vectors intelligently. This is where the RNN comes in. We talked about the structure of a recurrent neural network in our last video too, but it’s basically a model that slowly builds a hidden representation by incorporating one new word at a time. Depending on the task, the RNN will combine new knowledge in different ways. With John-Green-bot, we’re training our RNN with sequences of words from Vlogbrothers scripts. Ultimately, our AI is trying to build a good summary to make sure a sentence has some overall meaning, and it’s keeping track of the last word to produce a sentence that sounds like English. The RNN’s output after reading the final word so far in a sentence is what we’ll use to predict the next word. And this is what we’ll use to train John-Green-bot’s AI after we build it. All of this is wrapped up in code block 2.3 So Part 2 is done. We've got our embedding matrix and our RNN. Now, we’re ready for Step 3: train our model. Remember when we split the data into pieces called batches? And remember earlier in Crash Course AI when we used backpropagation to train neural networks? Well we can put those pieces together, iterate over our dataset, and run backpropagation on each example to train the model’s weights. So in step 3.1 we’re defining how to train our model and in step 3.2 we’re defining how to evaluate our model and in step 3.3 we’re actually creating our model. Which means training and evaluating it. Over the span of one epoch of training this model, the network will loop over every batch of data -- reading it in, building representations, predicting the next word, and then updating its guesses. This will train over 10 epochs, which might take a couple minutes. We’re printing two numbers with each epoch, which are the model’s training and validation perplexities. As the model learns, it realizes there are fewer and fewer good choices for the next word. The perplexity is a measure of how well the model has narrowed down the choices. Okay, it looks like the model is done training and has a perplexity of about 45 on train and 72 on validation, but it started with perplexities in the hundreds! We can interpret perplexity as the average number of guesses the model makes before it predicts the right answer. After seeing the data once, the model needed over 300 guesses for the next word, but now it’s narrowed it down to fewer than 50. That’s a pretty good improvement, even though it’s far from perfect. Time to see what the model can write, but to do that, we need one final ingredient. So far in Crash Course AI, we’ve talked a lot about the one best label or the one best prediction an AI can make, but this doesn’t always make sense to solve certain problems. If you wrote stories by always having characters do the next obvious thing, they’d be pretty boring. So Step 4 is inference, the part of AI where the machine gets to make some choices, but we can still help it a little bit. Let’s think about what the final layer of the RNN is actually doing. We talk about it like it’s outputting a single label or prediction, but actually the network is producing a bunch of scores or probabilities. The most likely word has the highest probability, the next most likely word has the second highest probability, and so on. Because we get probabilities at every step, instead of taking the best one each time to produce 1 sentence, we could sample 3 words and start 3 new sentences. Each of those 3 sentences could then start 3 more new sentences… and then we have a branching diagram of possibilities. Inference is so important because what the model can produce and what we want aren’t necessarily the same thing. What we want is a really good sentence, but the model can only tell us the score for one word at a time. Let’s look at this branching diagram. Whenever we choose a word, we create a new branch, and keep track of its score or probability. If we multiply each score through to the end of the branch, we see that the top branch, made the best scoring choice, but a worse sentence overall. So we’re going to implement a basic sampler in our program. This will take a bunch of random paths, so we can sort the results by the probability of the full sentences, and we can see which sentences are best overall. Also, when asking John-Green-bot to generate all these sentences, we need to give him a word to start. I’m going to try “Good” for now, but you can try other things by changing the code in 4.1. Remember the preprocessing we did on our data? That's why these sentences look a little off, with hashtags for numbers, and the space before word endings that we introduced when stemming. And look at the sentence you get from taking the highest probability word each time. Good morning Hank, it’s Tuesday. I’m going to be like, I’m going to be like, I’m going to be like, I’m going to see it isn’t as interesting as the ones where we mixed it up a bit and took different branches. To be honest though… none of these are great Vlogbrothers scripts. That’s because of two important things: First, there’s our data. Remember, we didn’t have many examples of how to use each word. In fact, we had to cut out a lot of “rare words” during training because they only showed up once, so we couldn’t teach John-Green-bot to recognize any patterns related to them. Lots of state-of-the-art models address this by downloading data from Wikipedia, large collections of books, or even Reddit when they train their models. We’ll include some links in the description if you want to play with some fancier models. But the second, bigger issue is that AI models are missing the understanding we have as humans. Even if John Green Bot split up words perfectly and predicted sentences that sound like English, it’s still John-Green-bot using tools like tokenization, an embedding matrix, and a simple language model to predict the next word. When human John Green writes, he uses his understanding of the world, like in Vlogbrothers videos, he considers Hank’s perspective or whoever’s watching. He’s not just trying to predict which next word has the highest probability. Building models that interact with people, and the world, is why natural language processing is so exciting, but it’s also why it’ll take a lot more work to get John-Green-bot to generate language as well as human John Green does. We’ve left a bunch of notes in the code for you to play if you want to make your own AI. You can train for longer, change the sentence prompt, or, if you’re feeling adventurous, replace the text data to speak in someone else’s voice. If you end up using this to make something cool let us know in the comments. Thanks for watching, see you next week. PBS Digital Studios wants to hear from you. We do a survey every year that asks what you're into, your favorite pbs shows, and things you would like to see more from PBS Digital Studios. You even get to vote on potential new shows. All of this helps us make more stuff that you want to see. The survey takes about 10 minutes and you might win a sweet t-shirt. Link is in the description. Thanks. Crash Course AI is produced in association with PBS Digital Studios! If you want to help keep all Crash Course free for everybody, forever, you can join our community on Patreon. And if you want to learn more about NLP check out this video from Crash Course Computer Science.